# You should download and unzip "fastBCR-main.zip" from "https://github.com/ZhangLabTJU/fastBCR" and set it as working directory before running pipeline.
# knitr::opts_knit$set(root.dir = "THE/PATH/TO/FASTBCR/FOLDER")
knitr::opts_knit$set(root.dir = "~/Documents/Rpackage/fastBCR-main") # Replace with your path.
library(fastBCR)
Loading required package: dplyr
Warning: package ‘dplyr’ was built under R version 4.2.3
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: ggplot2
Loading required package: ggmsa
Registered S3 methods overwritten by 'ggalt':
  method                  from   
  grid.draw.absoluteGrob  ggplot2
  grobHeight.absoluteGrob ggplot2
  grobWidth.absoluteGrob  ggplot2
  grobX.absoluteGrob      ggplot2
  grobY.absoluteGrob      ggplot2
ggmsa v1.4.0  Document: http://yulab-smu.top/ggmsa/

If you use ggmsa in published research, please cite:
L Zhou, T Feng, S Xu, F Gao, TT Lam, Q Wang, T Wu, H Huang, L Zhan, L Li, Y Guan, Z Dai*, G Yu* ggmsa: a visual exploration tool for multiple sequence alignment and associated data. Briefings in Bioinformatics. DOI:10.1093/bib/bbac222
Loading required package: msa
Loading required package: Biostrings
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl,
    intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
    setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:dplyr’:

    first, rename

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

Loading required package: IRanges

Attaching package: ‘IRanges’

The following objects are masked from ‘package:dplyr’:

    collapse, desc, slice

Loading required package: XVector
Loading required package: GenomeInfoDb

Attaching package: ‘Biostrings’

The following object is masked from ‘package:base’:

    strsplit

Loading required package: statnet
Loading required package: tergm
Loading required package: ergm
Warning: package ‘ergm’ was built under R version 4.2.3Loading required package: network
Warning: package ‘network’ was built under R version 4.2.3
‘network’ 1.18.2 (2023-12-04), part of the Statnet Project
* ‘news(package="network")’ for changes since last version
* ‘citation("network")’ for citation information
* ‘https://statnet.org’ for help, support, and other information


‘ergm’ 4.6.0 (2023-12-17), part of the Statnet Project
* ‘news(package="ergm")’ for changes since last version
* ‘citation("ergm")’ for citation information
* ‘https://statnet.org’ for help, support, and other information

‘ergm’ 4 is a major update that introduces some backwards-incompatible changes. Please type ‘news(package="ergm")’ for a list of major changes.

Loading required package: networkDynamic
Warning: package ‘networkDynamic’ was built under R version 4.2.3
‘networkDynamic’ 0.11.4 (2023-12-10?), part of the Statnet Project
* ‘news(package="networkDynamic")’ for changes since last version
* ‘citation("networkDynamic")’ for citation information
* ‘https://statnet.org’ for help, support, and other information

Registered S3 method overwritten by 'tergm':
  method                   from
  simulate_formula.network ergm

‘tergm’ 4.2.0 (2023-05-30), part of the Statnet Project
* ‘news(package="tergm")’ for changes since last version
* ‘citation("tergm")’ for citation information
* ‘https://statnet.org’ for help, support, and other information


Attaching package: ‘tergm’

The following object is masked from ‘package:ergm’:

    snctrl

Loading required package: ergm.count

‘ergm.count’ 4.1.1 (2022-05-24), part of the Statnet Project
* ‘news(package="ergm.count")’ for changes since last version
* ‘citation("ergm.count")’ for citation information
* ‘https://statnet.org’ for help, support, and other information

Loading required package: sna
Warning: package ‘sna’ was built under R version 4.2.3Loading required package: statnet.common

Attaching package: ‘statnet.common’

The following object is masked from ‘package:ergm’:

    snctrl

The following object is masked from ‘package:Biostrings’:

    order

The following object is masked from ‘package:XVector’:

    order

The following object is masked from ‘package:IRanges’:

    order

The following object is masked from ‘package:S4Vectors’:

    order

The following object is masked from ‘package:BiocGenerics’:

    order

The following objects are masked from ‘package:base’:

    attr, order

sna: Tools for Social Network Analysis
Version 2.7-2 created on 2023-12-05.
copyright (c) 2005, Carter T. Butts, University of California-Irvine
 For citation information, type citation("sna").
 Type help(package="sna") to get started.

Loading required package: tsna

‘statnet’ 2019.6 (2019-06-13), part of the Statnet Project
* ‘news(package="statnet")’ for changes since last version
* ‘citation("statnet")’ for citation information
* ‘https://statnet.org’ for help, support, and other information
### Fast BCR clonal family inference
## 1. Data loading
# Path to the "COVID"/"HC" folder in the "example" folder in fastBCR project.
COVID_folder_path <- "example/COVID"
HC_folder_path <- "example/HC"
# Load files from "COVID_folder_path"/"HC_folder_path" into list.
# The storage format of data can be in "csv", "tsv", or "Rdata" format.
# The compressed files in the above storage format in "7zip", "cab", "cpio", "iso9660", "lha", "mtree", "shar", "rar", "raw", "tar", "xar", "zip", "warc" format can also be read in.
COVID_raw_data_list <- data.load(folder_path = COVID_folder_path, storage_format = "csv")

  |                                                                                                                                    
  |                                                                                                                              |   0%
  |                                                                                                                                    
  |=========================                                                                                                     |  20%
  |                                                                                                                                    
  |==================================================                                                                            |  40%
  |                                                                                                                                    
  |============================================================================                                                  |  60%
  |                                                                                                                                    
  |=====================================================================================================                         |  80%
  |                                                                                                                                    
  |==============================================================================================================================| 100%
You have loaded 5 samples
Sample 'COVID_01' contains 298430 sequences
Sample 'COVID_02' contains 326522 sequences
Sample 'COVID_03' contains 317836 sequences
Sample 'COVID_04' contains 294034 sequences
Sample 'COVID_05' contains 372852 sequences
HC_raw_data_list <- data.load(folder_path = HC_folder_path, storage_format = "csv")

  |                                                                                                                                    
  |                                                                                                                              |   0%
  |                                                                                                                                    
  |=========================                                                                                                     |  20%
  |                                                                                                                                    
  |==================================================                                                                            |  40%
  |                                                                                                                                    
  |============================================================================                                                  |  60%
  |                                                                                                                                    
  |=====================================================================================================                         |  80%
  |                                                                                                                                    
  |==============================================================================================================================| 100%
You have loaded 5 samples
Sample 'HC_01' contains 152349 sequences
Sample 'HC_02' contains 132252 sequences
Sample 'HC_03' contains 124646 sequences
Sample 'HC_04' contains 97251 sequences
Sample 'HC_05' contains 110364 sequences
## 2. Data preprocessing
# Preprocessing of raw data to meet the input requirements for clonal family inference.
# The input data needs to contain essential columns including "v_call" (V gene with or without allele), "j_call" (J gene with or without allele) and "junction_aa" (amino acid translation of the junction). 
# The "junction" (junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons) column is needed for phylogenetic tree construction and SHM-related analysis
# The "c_call" (constant region gene with or without allele) column is needed for isotype-related analysis.
# During data preprocessing, only productive sequences whose junction amino acid lengths between 9 and 26 are reserved.
# The "count_col_name" parameter represents the column name for the count of each sequence.
# It defaults to "NA" which means the original count of the sequence is not taken into account.
# Sequences with the same "v_call", "j_call" and "junction_aa" are considered to be the same clonotype and are merged into one row in processed data.
# The column "clonotype_count" is the number of sequences belonging to each clonotype.
# The column "clone_count" is the sum of the counts (calculated based on "count_col_name" parameter) of the sequences belonging to each clonotype.
# The column "clone_fre" is the frequency version of "clone_count".
# The information of the sequence with the highest count in each clonotype is retained.
# The list "index_match" saves the original indexes of sequences for each clonotype.
COVID_pro_data_list <- data.preprocess(data_list = COVID_raw_data_list, count_col_name = "consensus_count")

  |                                                                                                                                    
  |                                                                                                                              |   0%
  |                                                                                                                                    
  |=========================                                                                                                     |  20%
  |                                                                                                                                    
  |==================================================                                                                            |  40%
  |                                                                                                                                    
  |============================================================================                                                  |  60%
  |                                                                                                                                    
  |=====================================================================================================                         |  80%
  |                                                                                                                                    
  |==============================================================================================================================| 100%
You have preprocessed 5 samples
Sample 'COVID_01' contains 88502 unique clonotypes
Sample 'COVID_02' contains 101239 unique clonotypes
Sample 'COVID_03' contains 85374 unique clonotypes
Sample 'COVID_04' contains 108583 unique clonotypes
Sample 'COVID_05' contains 119386 unique clonotypes
HC_pro_data_list <- data.preprocess(data_list = HC_raw_data_list, count_col_name = "consensus_count")

  |                                                                                                                                    
  |                                                                                                                              |   0%
  |                                                                                                                                    
  |=========================                                                                                                     |  20%
  |                                                                                                                                    
  |==================================================                                                                            |  40%
  |                                                                                                                                    
  |============================================================================                                                  |  60%
  |                                                                                                                                    
  |=====================================================================================================                         |  80%
  |                                                                                                                                    
  |==============================================================================================================================| 100%
You have preprocessed 5 samples
Sample 'HC_01' contains 105364 unique clonotypes
Sample 'HC_02' contains 96186 unique clonotypes
Sample 'HC_03' contains 91897 unique clonotypes
Sample 'HC_04' contains 74450 unique clonotypes
Sample 'HC_05' contains 93030 unique clonotypes
## 3. BCR clonal family inference
# Fast clonal family inference from preprocessed data.
# The "cluster_thre" parameter represents minimal clustering criteria (minimum number of sequences needed to form a cluster) and defaults to 3.
# For high efficiency, the "cluster_thre" is increased by 1 for every 100,000 entries of input data.
# The "overlap_thre" parameter represents overlap coefficient threshold for merging two clusters, selectable range (0,1) and defaults to 0.1.
# Lower "overlap_thre" may lead to overclustering while higher thresholds may lead to the split of clonal families.
# The "consensus_thre" parameter represents the consensus score threshold for filtering candidates and defaults to 0.8.
# A higher "consensus_thre" means stricter inference of the cluster.
COVID_clusters_list <- data.BCR.clusters(pro_data_list = COVID_pro_data_list, cluster_thre = 3, overlap_thre = 0.1, consensus_thre = 0.8)

  |                                                                                                                                    
  |                                                                                                                              |   0%
  |                                                                                                                                    
  |=========================                                                                                                     |  20%
  |                                                                                                                                    
  |==================================================                                                                            |  40%
  |                                                                                                                                    
  |============================================================================                                                  |  60%
  |                                                                                                                                    
  |=====================================================================================================                         |  80%
  |                                                                                                                                    
  |==============================================================================================================================| 100%
You have clustered 5 samples
Sample 'COVID_01' contains 4903 clonal families
Sample 'COVID_02' contains 3899 clonal families
Sample 'COVID_03' contains 4614 clonal families
Sample 'COVID_04' contains 3145 clonal families
Sample 'COVID_05' contains 3303 clonal families
HC_clusters_list <- data.BCR.clusters(pro_data_list = HC_pro_data_list, cluster_thre = 3, overlap_thre = 0.1, consensus_thre = 0.8)

  |                                                                                                                                    
  |                                                                                                                              |   0%
  |                                                                                                                                    
  |=========================                                                                                                     |  20%
  |                                                                                                                                    
  |==================================================                                                                            |  40%
  |                                                                                                                                    
  |============================================================================                                                  |  60%
  |                                                                                                                                    
  |=====================================================================================================                         |  80%
  |                                                                                                                                    
  |==============================================================================================================================| 100%
You have clustered 5 samples
Sample 'HC_01' contains 799 clonal families
Sample 'HC_02' contains 2056 clonal families
Sample 'HC_03' contains 2305 clonal families
Sample 'HC_04' contains 1204 clonal families
Sample 'HC_05' contains 1624 clonal families
### Downstream analysis
## 1. Single cluster analysis
## 1.1 Conserved motifs distribution
# Visualization of multiple sequence alignment (MSA) of junction sequences within a cluster.
# The parameter "index" allows you to choose a cluster for visualization.
# The parameter "type" can be "AA" for amino acid or "DNA" for deoxyribonucleic acid.
# The parameter "raw_data" is the raw data which the clonal families inferred from.
# It is needed if you want to plot the MSA of 'DNA' sequences.
# fastBCR will retrieve all the DNA sequences, which can be multiple sequences due to the degeneracy of codons, that correspond to the amino acid sequence of each clonotype from the raw data
COVID_01_clusters <- COVID_clusters_list$COVID_01
HC_01_clusters <- HC_clusters_list$HC_01
msa.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "AA")
Coordinate system already present. Adding new coordinate system, which will replace the existing one.

msa.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "DNA", raw_data = COVID_raw_data_list$COVID_01)
Coordinate system already present. Adding new coordinate system, which will replace the existing one.

## 1.2 Sequence logo
# Visualization of sequence logo of junction sequences within a cluster.
seqlogo.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "AA")
Coordinate system already present. Adding new coordinate system, which will replace the existing one.

seqlogo.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "DNA", raw_data = COVID_raw_data_list$COVID_01)
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.Coordinate system already present. Adding new coordinate system, which will replace the existing one.

## 1.3 Evolutionary tree
# Reconstruct a B cell lineage tree with minimum spanning tree and genotype abundances using ClonalTree.
# The junction of BCR sequences within a cluster will be written as "ClonalFamily_index.fasta" in "ClonalTree/Examples/input" folder.
# ClonalTree uses Python for compilation, so it needs the absolute path of the Python interpreter in the "python_path" parameter.
# ClonalTree returns two files in the "ClonalTree/Examples/output" folder.
# ClonalFamily_index.nk: the reconstructed BCR lineage tree in newick format.
# ClonalFamily_index.nk.csv: a table in csv format, containing the parent relationship and cost.
clonal.tree.generation(bcr_clusters = COVID_01_clusters, index = 200, raw_data = COVID_raw_data_list$COVID_01, python_path = "/Users/wangkaixuan/anaconda3/bin/python") # Replace with your path
use default substitution matrix
Parameter setting = useAbundance:  False ; revision:  True ; trim: False
done
# You can use the clonal.tree.plot() function to visualize the evolutionary tree in R.
# The consensus sequence of the cluster is used as the root node of the tree.
# Orange points represent nodes and blue points represent tips.
# The x-axis shows the absolute genetic distance.
clonal.tree.plot(nk_path = "ClonalTree/Examples/output/ClonalFamily_200.abRT.nk")

## 1.4 Class switch recombination (CSR) network
# Visualization of isotype co-occurrence within clusters within a cluster
# Circle size represents the number of sequences carrying a given isotype.
# Lines connecting two circles indicate the enrichment level of observing switches in the two corresponding immunoglobulin subclasses.
# The enrichment level is the ratio of observed and expected switches if immunoglobulin isotypes are assumed to be independently distributed among cluster.
# Matrix of values of connected edges between clustered sequences in different isotypes is printed.
CSR.cluster.plot(bcr_clusters = COVID_01_clusters, index = 50)
      IGHM IGHD IGHG3 IGHG1 IGHA1 IGHG2 IGHG4 IGHE IGHA2
IGHM     0    0     0     1     1     0     0    0     0
IGHD     0    0     0     0     0     0     0    0     0
IGHG3    0    0     0     0     0     0     0    0     0
IGHG1    1    0     0     0     1     0     0    0     0
IGHA1    1    0     0     1     0     0     0    0     0
IGHG2    0    0     0     0     0     0     0    0     0
IGHG4    0    0     0     0     0     0     0    0     0
IGHE     0    0     0     0     0     0     0    0     0
IGHA2    0    0     0     0     0     0     0    0     0

## 2. Single sample/group analysis
## 2.1 Classification of clustered and unclustered sequences
# Merge all the clustered sequences in each sample into "clustered_seqs".
# Merge all the unclustered sequences in each sample into "unclustered_seqs".
COVID_seqs_list <- Clustered.seqs(pro_data_list = COVID_pro_data_list, clusters_list = COVID_clusters_list)

  |                                                                                                                                    
  |                                                                                                                              |   0%
  |                                                                                                                                    
  |=========================                                                                                                     |  20%
  |                                                                                                                                    
  |==================================================                                                                            |  40%
  |                                                                                                                                    
  |============================================================================                                                  |  60%
  |                                                                                                                                    
  |=====================================================================================================                         |  80%
  |                                                                                                                                    
  |==============================================================================================================================| 100%
You have classified clustered and unclustered sequences for 5 samples
Sample 'COVID_01' contains 43862 clustered sequences and 44640 unclustered sequences
Sample 'COVID_02' contains 43901 clustered sequences and 57338 unclustered sequences
Sample 'COVID_03' contains 46536 clustered sequences and 38838 unclustered sequences
Sample 'COVID_04' contains 35007 clustered sequences and 73576 unclustered sequences
Sample 'COVID_05' contains 43627 clustered sequences and 75759 unclustered sequences
HC_seqs_list <- Clustered.seqs(pro_data_list = HC_pro_data_list, clusters_list = HC_clusters_list)

  |                                                                                                                                    
  |                                                                                                                              |   0%
  |                                                                                                                                    
  |=========================                                                                                                     |  20%
  |                                                                                                                                    
  |==================================================                                                                            |  40%
  |                                                                                                                                    
  |============================================================================                                                  |  60%
  |                                                                                                                                    
  |=====================================================================================================                         |  80%
  |                                                                                                                                    
  |==============================================================================================================================| 100%
You have classified clustered and unclustered sequences for 5 samples
Sample 'HC_01' contains 5386 clustered sequences and 99978 unclustered sequences
Sample 'HC_02' contains 8582 clustered sequences and 87604 unclustered sequences
Sample 'HC_03' contains 9914 clustered sequences and 81983 unclustered sequences
Sample 'HC_04' contains 4946 clustered sequences and 69504 unclustered sequences
Sample 'HC_05' contains 6545 clustered sequences and 86485 unclustered sequences
## 2.2 Summary of clusters from a sample
# Summarize the number of clusters, the average size of clusters and the proportion of clustered sequences.
COVID_clusters_summary <- Clusters.summary(pro_data_list = COVID_pro_data_list, clusters_list = COVID_clusters_list)
HC_clusters_summary <- Clusters.summary(pro_data_list = HC_pro_data_list, clusters_list = HC_clusters_list)
COVID_01_summary <- COVID_clusters_summary$COVID_01
print('Summary of COVID_01:')
[1] "Summary of COVID_01:"
print(COVID_01_summary)
$`number of clusters`
[1] 4903

$`average size of clusters`
[1] 8.95

$`number of clustered seqs`
[1] 43862

$`number of all seqs`
[1] 88502

$`proportion of clustered sequences`
[1] "49.56%"
HC_01_summary <- HC_clusters_summary$HC_01
print('Summary of HC_01:')
[1] "Summary of HC_01:"
print(HC_01_summary)
$`number of clusters`
[1] 799

$`average size of clusters`
[1] 6.74

$`number of clustered seqs`
[1] 5386

$`number of all seqs`
[1] 105364

$`proportion of clustered sequences`
[1] "5.11%"
## 2.3 Visualization of clusters from a sample
# Point diagram showing clusters from a sample where a circle represents a cluster.
# The size and color of the circle represents the size of the cluster.
# The parameter "index" represents the index of the sample for visualization.
Clusters.visualization(pro_data_list = COVID_pro_data_list, clusters_list = COVID_clusters_list, index = 1)

Clusters.visualization(pro_data_list = HC_pro_data_list, clusters_list = HC_clusters_list, index = 1)

## 2.4 V/J gene usage
## 2.4.1 Pie chart: V/J gene
# Pie chart showing the gene usage frequency of clustered sequences.
# The top ten most frequent genes are shown, and the rest are represented by "others".
# The parameter "colname" can be "v_call" for V gene or "j_call" for J gene.
# (1) single sample
COVID_01_clustered_seqs <- COVID_seqs_list$clustered_seqs$COVID_01
COVID_01_unclustered_seqs <- COVID_seqs_list$unclustered_seqs$COVID_01
HC_01_clustered_seqs <- HC_seqs_list$clustered_seqs$HC_01
pie.freq.plot(clustered_seqs = COVID_01_clustered_seqs, colname = "v_call")

pie.freq.plot(clustered_seqs = HC_01_clustered_seqs, colname = "v_call")

# (2) multiple samples in a group.
# All the clustered/unclustered sequences from multiple samples in a group can be merged.
COVID_all_clustered_seqs <- NULL
for(i in 1:length(COVID_seqs_list$clustered_seqs)){
  COVID_all_clustered_seqs <- rbind(COVID_all_clustered_seqs, COVID_seqs_list$clustered_seqs[[i]])
}
COVID_all_unclustered_seqs <- NULL
for(i in 1:length(COVID_seqs_list$unclustered_seqs)){
  COVID_all_unclustered_seqs <- rbind(COVID_all_unclustered_seqs, COVID_seqs_list$unclustered_seqs[[i]])
}
HC_all_clustered_seqs <- NULL
for(i in 1:length(HC_seqs_list$clustered_seqs)){
  HC_all_clustered_seqs <- rbind(HC_all_clustered_seqs, HC_seqs_list$clustered_seqs[[i]])
}
HC_all_unclustered_seqs <- NULL
for(i in 1:length(HC_seqs_list$unclustered_seqs)){
  HC_all_unclustered_seqs <- rbind(HC_all_unclustered_seqs, HC_seqs_list$unclustered_seqs[[i]])
}
pie.freq.plot(clustered_seqs = COVID_all_clustered_seqs, colname = "v_call")

pie.freq.plot(clustered_seqs = HC_all_clustered_seqs, colname = "v_call")

## 2.4.2 Heatmap: V-J gene pair
# Heatmap showing the V-J gene pair frequency of clustered sequences
# (1) single sample.
vjpair.sample.plot(clustered_seqs = COVID_01_clustered_seqs)
Using Freq as value column: use value.var to override.

# (2) multiple samples in a group.
vjpair.sample.plot(clustered_seqs = COVID_all_clustered_seqs)
Using Freq as value column: use value.var to override.

vjpair.sample.plot(clustered_seqs = HC_all_clustered_seqs)
Using Freq as value column: use value.var to override.

## 2.5 Junction length distribution
# 2.5.1 Histogram and density plot showing the length distribution of junction amino acid of clustered sequences.
# (1) single sample
len.sample.plot(clustered_seqs = COVID_01_clustered_seqs)

# (2) multiple samples in a group
len.sample.plot(clustered_seqs = COVID_all_clustered_seqs)

# 2.5.2 Density ridges showing the length distributions of junction amino acid of clustered sequences and unclustered sequences.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
# (1) single sample
len.clustered.plot(clustered_seqs = COVID_01_clustered_seqs,
                   unclustered_seqs = COVID_01_unclustered_seqs)
[1] "p-value: <2e-16"

# (2) multiple samples in a group
len.clustered.plot(clustered_seqs = COVID_all_clustered_seqs,
                   unclustered_seqs = COVID_all_unclustered_seqs)
[1] "p-value: <2e-16"

## 2.6 Class switch recombination (CSR) network
# Visualization of isotype co-occurrence within clusters from a sample.
# Circle size represents the number of sequences carrying a given isotype.
# Lines connecting two circles indicate the enrichment level of observing switches in the two corresponding immunoglobulin subclasses.
# The enrichment level is the ratio of observed and expected switches if immunoglobulin isotypes are assumed to be independently distributed among cluster.
# Matrix of values of connected edges between clustered sequences in different isotypes is printed.
CSR.sample.plot(bcr_clusters = COVID_01_clusters)
      IGHM IGHD IGHG3 IGHG1 IGHA1 IGHG2 IGHG4 IGHE IGHA2
IGHM     0   91    39   465   256    79     1    0    51
IGHD    91    0     4    45    28     8     0    0     7
IGHG3   39    4     0    96    41    32     0    0     6
IGHG1  465   45    96     0   332   165     1    1    50
IGHA1  256   28    41   332     0    80     1    0   108
IGHG2   79    8    32   165    80     0     2    0    56
IGHG4    1    0     0     1     1     2     0    0     0
IGHE     0    0     0     1     0     0     0    0     0
IGHA2   51    7     6    50   108    56     0    0     0

CSR.sample.plot(bcr_clusters = HC_01_clusters)
      IGHM IGHD IGHG3 IGHG1 IGHA1 IGHG2 IGHG4 IGHE IGHA2
IGHM     0  241    11    25    58    51     1    0    36
IGHD   241    0     5    32    42    32     1    0    25
IGHG3   11    5     0    12    16    19     1    0     8
IGHG1   25   32    12     0    72    58     1    0    34
IGHA1   58   42    16    72     0    97     2    0    75
IGHG2   51   32    19    58    97     0     4    0    72
IGHG4    1    1     1     1     2     4     0    0     2
IGHE     0    0     0     0     0     0     0    0     0
IGHA2   36   25     8    34    75    72     2    0     0

### 2.7 Neutralizing antibody (NAb) sequence query
## 2.7.1 Load the public antibody database to get the information of neutralizing antibody (NAb) sequences.
# Here is an example of the Coronavirus Antibody Database (CoV-AbDab).
CoV_AbDab <- read.csv("example/CoV-AbDab_130623.csv")
## 2.7.2 NAb query
# Query the corresponding sequence from the public antibody database.
# The parameter "method" represents the CDRH3 matching method.
# It can be "NA" for perfect match, "hamming" for hamming distance or "lv" for Levenshtein distance. Defaults to "NA".
# The parameter "species" can be "Mouse" or "Human". Defaults to "Human".
# The parameter "maxDist" represents the maximum distance allowed for matching when the argument "method" is "hamming" or "lv". Defaults to "NA".
# example to perfect match
human_perfect_match <- NAb.query(bcr_clusters = COVID_01_clusters, AbDab = CoV_AbDab, method = NA, maxDist = NA, species = "Human")
[1] "The query result has 26 rows."
head(human_perfect_match)
# example with perfect match in "Mouse" species
mouse_perfect_match <- NAb.query(bcr_clusters = COVID_01_clusters, AbDab = CoV_AbDab, method = NA, maxDist = NA, species = "Mouse")
[1] "The query result is empty."
# example with fuzzy matching with "hamming" method and max distance of 1
human_hamming_1_match <- NAb.query(bcr_clusters = COVID_01_clusters, AbDab = CoV_AbDab, method = "hamming", maxDist = 1, species = "Human")
[1] "The query result has 220 rows."
## 3. Inter group analysis
## 3.1 V/J gene usage
## 3.1.1 Boxplot: V/J gene
# Boxplot showing the V/J gene usage of the clustered sequences between two groups.
# The parameter "colname" can be "v_call" for V gene or "j_call" for J gene.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
gene.fre.plot(group1_seqs_list = COVID_seqs_list,
              group1_all_clustered_seqs = COVID_all_clustered_seqs,
              group1_label = "COVID-19",
              group2_seqs_list = HC_seqs_list,
              group2_all_clustered_seqs = HC_all_clustered_seqs,
              group2_label = "HC",
              colname = "v_call")

gene.fre.plot(group1_seqs_list = COVID_seqs_list,
              group1_all_clustered_seqs = COVID_all_clustered_seqs,
              group1_label = "COVID-19",
              group2_seqs_list = HC_seqs_list,
              group2_all_clustered_seqs = HC_all_clustered_seqs,
              group2_label = "HC",
              colname = "j_call")

## 3.1.2 Heatmap: V-J gene pair
# Heatmap showing the fold change of V-J gene pair frequency of clustered sequences between two groups.
# Log fold change (log FC) is calculated as the log2 ratio of the average values between group1 and group2 samples.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
# FDR correction was performed with the Benjamini–Hochberg procedure.
# The V-J gene pair frequency of samples with a frequency less than the minimum value (1‰) is set to the minimum value.
vjpair.group.plot(group1_seqs_list = COVID_seqs_list,
                  group1_all_clustered_seqs = COVID_all_clustered_seqs,
                  group1_label = "COVID-19",
                  group2_seqs_list = HC_seqs_list,
                  group2_all_clustered_seqs = HC_all_clustered_seqs,
                  group2_label = "HC")

## 3.2 Junction length
# Density ridges showing the length distribution of junction amino acid of clustered sequences between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
len.group.plot(group1_all_clustered_seqs = COVID_all_clustered_seqs, group1_label = "COVID-19",
               group2_all_clustered_seqs = HC_all_clustered_seqs, group2_label = "HC")
[1] "p-value: <2e-16"

## 3.3 Diversity analysis
## 3.3.1 Number and size of clusters
# Bubble plot showing the size and number of clusters between two groups.
clu.size.plot(clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
              clusters_list2 = HC_clusters_list, group2_label = "HC")
[1] "p-value: 1.1e-15"

## 3.3.2 Tcf20 score
# Tcf20 score represents the proportion of sequences attributed to the top 20 clonal families out of the total number of BCR sequences.
# Boxplot showing the Tcf20 scores between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
Tcf20.plot(clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
           clusters_list2 = HC_clusters_list, group2_label = "HC")
[1] "p-value: 0.0079"

## 3.4 Somatic hypermutation (SHM) analysis
# Calculate the average SHM ratio of all clusters in each sample.
# The calculation of SHM ratios may take a while.
SHM_df = SHM.calculation(clusters_list1 = COVID_clusters_list, raw_data_list1 = COVID_raw_data_list, group1_label = "COVID-19",
                         clusters_list2 = HC_clusters_list, raw_data_list2 = HC_raw_data_list, group2_label = "HC")
[1] "group1 processing:"

  |                                                                                                                                                                                               
  |                                                                                                                                                                                         |   0%
  |                                                                                                                                                                                               
  |=====================================                                                                                                                                                    |  20%
  |                                                                                                                                                                                               
  |==========================================================================                                                                                                               |  40%
  |                                                                                                                                                                                               
  |===============================================================================================================                                                                          |  60%
  |                                                                                                                                                                                               
  |====================================================================================================================================================                                     |  80%
  |                                                                                                                                                                                               
  |=========================================================================================================================================================================================| 100%[1] "group2 processing:"

  |                                                                                                                                                                                               
  |                                                                                                                                                                                         |   0%
  |                                                                                                                                                                                               
  |=====================================                                                                                                                                                    |  20%
  |                                                                                                                                                                                               
  |==========================================================================                                                                                                               |  40%
  |                                                                                                                                                                                               
  |===============================================================================================================                                                                          |  60%
  |                                                                                                                                                                                               
  |====================================================================================================================================================                                     |  80%
  |                                                                                                                                                                                               
  |=========================================================================================================================================================================================| 100%
# Boxplot showing the SHM ratios between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
SHM.plot(SHM_df = SHM_df)
[1] "p-value: 0.0079"

# Calculate the SHM ratios of clustered sequences in different isotypes in each sample.
# The calculation of SHM ratios may take a while.
SHM_iso_df = SHM.iso.calculation(clusters_list1 = COVID_clusters_list, raw_data_list1 = COVID_raw_data_list, group1_label = "COVID-19",
                                 clusters_list2 = HC_clusters_list, raw_data_list2 = HC_raw_data_list, group2_label = "HC")
[1] "group1 processing:"

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%[1] "group2 processing:"

  |                                                                                                                                                                  
  |                                                                                                                                                            |   0%
  |                                                                                                                                                                  
  |===============================                                                                                                                             |  20%
  |                                                                                                                                                                  
  |==============================================================                                                                                              |  40%
  |                                                                                                                                                                  
  |==============================================================================================                                                              |  60%
  |                                                                                                                                                                  
  |=============================================================================================================================                               |  80%
  |                                                                                                                                                                  
  |============================================================================================================================================================| 100%
# Boxplot showing the SHM ratios of clustered sequences in different isotypes between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
SHM.iso.plot(SHM_iso_df = SHM_iso_df)
 [1] "COVID-19_IGHD~COVID-19_IGHM p-value: 0.0079" "COVID-19_IGHD~COVID-19_IGHA p-value: 0.0079" "COVID-19_IGHD~COVID-19_IGHG p-value: 0.0079"
 [4] "COVID-19_IGHD~HC_IGHA p-value: 0.0159"       "COVID-19_IGHD~HC_IGHG p-value: 0.0079"       "COVID-19_IGHM~HC_IGHD p-value: 0.0079"      
 [7] "COVID-19_IGHM~HC_IGHM p-value: 0.0079"       "COVID-19_IGHM~HC_IGHA p-value: 0.0079"       "COVID-19_IGHM~HC_IGHG p-value: 0.0159"      
[10] "COVID-19_IGHA~HC_IGHD p-value: 0.0079"       "COVID-19_IGHA~HC_IGHM p-value: 0.0079"       "COVID-19_IGHA~HC_IGHA p-value: 0.0079"      
[13] "COVID-19_IGHA~HC_IGHG p-value: 0.0079"       "COVID-19_IGHG~HC_IGHD p-value: 0.0079"       "COVID-19_IGHG~HC_IGHM p-value: 0.0079"      
[16] "COVID-19_IGHG~HC_IGHA p-value: 0.0079"       "COVID-19_IGHG~HC_IGHG p-value: 0.0079"       "HC_IGHD~HC_IGHM p-value: 0.0159"            
[19] "HC_IGHD~HC_IGHA p-value: 0.0079"             "HC_IGHD~HC_IGHG p-value: 0.0079"             "HC_IGHM~HC_IGHA p-value: 0.0079"            
[22] "HC_IGHM~HC_IGHG p-value: 0.0079"            

### 3.5 NAb ratio calculation
## 3.5.1 Load the public antibody database to get the information of neutralizing antibody (NAb) sequences.
# Here is an example of the Coronavirus Antibody Database (CoV-AbDab).
CoV_AbDab <- read.csv("example/CoV-AbDab_130623.csv")

## 3.5.2 NAb ratio calculation
# NAb ratio is established as an indicator of the proportional prevalence of neutralizing antibody sequences within expanded clonalfamilies in each sample.
# It is defined as the fraction of the number of NAb sequences within clonal families to the total number of NAb sequences present in each sample.
# NAb sequences are queried using NAb.query() function where the parameter "method" represents the CDRH3 matching method.
# It can be "NA" for perfect match, "hamming" for hamming distance or "lv" for Levenshtein distance. Defaults to "NA".
# The parameter "species" can be "Mouse" or "Human". Defaults to "Human".
# The parameter "maxDist" represents the maximum distance allowed for matching when the argument "method" is "hamming" or "lv". Defaults to "NA".
NAb_ratio_df <- NAb.ratio.calculation(pro_data_list1 = COVID_pro_data_list, clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
                                      pro_data_list2 = HC_pro_data_list, clusters_list2 = HC_clusters_list, group2_label = "HC",
                                      AbDab = CoV_AbDab, method = NA, maxDist = NA, species = "Human")
[1] "group1 processing:"

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%[1] "group2 processing:"

  |                                                                                                                                                        
  |                                                                                                                                                  |   0%
  |                                                                                                                                                        
  |=============================                                                                                                                     |  20%
  |                                                                                                                                                        
  |==========================================================                                                                                        |  40%
  |                                                                                                                                                        
  |========================================================================================                                                          |  60%
  |                                                                                                                                                        
  |=====================================================================================================================                             |  80%
  |                                                                                                                                                        
  |==================================================================================================================================================| 100%
# Boxplot showing the NAb ratios between the two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
NAb.ratio.plot(NAb_ratio_df)
[1] "p-value: 0.025"

---
title: "fastBCR pipeline notebook"
output: html_notebook
---

```{r setup, include = TRUE}
# You should download and unzip "fastBCR-main.zip" from "https://github.com/ZhangLabTJU/fastBCR" and set it as working directory before running pipeline.
# knitr::opts_knit$set(root.dir = "THE/PATH/TO/FASTBCR/FOLDER")
knitr::opts_knit$set(root.dir = "~/Documents/Rpackage/fastBCR-main") # Replace with your path.
```

```{r}
library(fastBCR)
```

```{r}
### Fast BCR clonal family inference
## 1. Data loading
# Path to the "COVID"/"HC" folder in the "example" folder in fastBCR project.
COVID_folder_path <- "example/COVID"
HC_folder_path <- "example/HC"
# Load files from "COVID_folder_path"/"HC_folder_path" into list.
# The storage format of data can be in "csv", "tsv", or "Rdata" format.
# The compressed files in the above storage format in "7zip", "cab", "cpio", "iso9660", "lha", "mtree", "shar", "rar", "raw", "tar", "xar", "zip", "warc" format can also be read in.
COVID_raw_data_list <- data.load(folder_path = COVID_folder_path, storage_format = "csv")
HC_raw_data_list <- data.load(folder_path = HC_folder_path, storage_format = "csv")
```

```{r}
## 2. Data preprocessing
# Preprocessing of raw data to meet the input requirements for clonal family inference.
# The input data needs to contain essential columns including "v_call" (V gene with or without allele), "j_call" (J gene with or without allele) and "junction_aa" (amino acid translation of the junction). 
# The "junction" (junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons) column is needed for phylogenetic tree construction and SHM-related analysis.
# The "c_call" (constant region gene with or without allele) column is needed for isotype-related analysis.
# During data preprocessing, only productive sequences whose junction amino acid lengths between 9 and 26 are reserved.
# The "count_col_name" parameter represents the column name for the count of each sequence.
# It defaults to "NA" which means the original count of the sequence is not taken into account.
# Sequences with the same "v_call", "j_call" and "junction_aa" are considered to be the same clonotype and are merged into one row in processed data.
# The column "clonotype_count" is the number of sequences belonging to each clonotype.
# The column "clone_count" is the sum of the counts (calculated based on "count_col_name" parameter) of the sequences belonging to each clonotype.
# The column "clone_fre" is the frequency version of "clone_count".
# The information of the sequence with the highest count in each clonotype is retained.
# The list "index_match" saves the original indexes of sequences for each clonotype.
COVID_pro_data_list <- data.preprocess(data_list = COVID_raw_data_list, count_col_name = "consensus_count")
HC_pro_data_list <- data.preprocess(data_list = HC_raw_data_list, count_col_name = "consensus_count")
```

```{r}
## 3. BCR clonal family inference
# Fast clonal family inference from preprocessed data.
# The "cluster_thre" parameter represents minimal clustering criteria (minimum number of sequences needed to form a cluster) and defaults to 3.
# For high efficiency, the "cluster_thre" is increased by 1 for every 100,000 entries of input data.
# The "overlap_thre" parameter represents overlap coefficient threshold for merging two clusters, selectable range (0,1) and defaults to 0.1.
# Lower "overlap_thre" may lead to overclustering while higher thresholds may lead to the split of clonal families.
# The "consensus_thre" parameter represents the consensus score threshold for filtering candidates and defaults to 0.8.
# A higher "consensus_thre" means stricter inference of the cluster.
COVID_clusters_list <- data.BCR.clusters(pro_data_list = COVID_pro_data_list, cluster_thre = 3, overlap_thre = 0.1, consensus_thre = 0.8)
HC_clusters_list <- data.BCR.clusters(pro_data_list = HC_pro_data_list, cluster_thre = 3, overlap_thre = 0.1, consensus_thre = 0.8)
```

```{r, fig.height = 8, fig.width = 5}
### Downstream analysis
## 1. Single cluster analysis
## 1.1 Conserved motifs distribution
# Visualization of multiple sequence alignment (MSA) of junction sequences within a cluster.
# The parameter "index" allows you to choose a cluster for visualization.
# The parameter "type" can be "AA" for amino acid or "DNA" for deoxyribonucleic acid.
# The parameter "raw_data" is the raw data which the clonal families inferred from and defaults to "NA".
# It is needed if you want to plot the MSA of 'DNA' sequences.
# fastBCR will retrieve all the DNA sequences, which can be multiple sequences due to the degeneracy of codons, that correspond to the amino acid sequence of each clonotype from the raw data
COVID_01_clusters <- COVID_clusters_list$COVID_01
HC_01_clusters <- HC_clusters_list$HC_01
msa.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "AA")
msa.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "DNA", raw_data = COVID_raw_data_list$COVID_01)
```
```{r, fig.height = 2.5, fig.width = 14}
## 1.2 Sequence logo
# Visualization of sequence logo of junction sequences within a cluster.
seqlogo.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "AA")
seqlogo.plot(bcr_clusters = COVID_01_clusters, index = 200, type = "DNA", raw_data = COVID_raw_data_list$COVID_01)
```

```{r}
## 1.3 Evolutionary tree
# Reconstruct a B cell lineage tree with minimum spanning tree and genotype abundances using ClonalTree.
# The junction of BCR sequences within a cluster will be written as "ClonalFamily_index.fasta" in "ClonalTree/Examples/input" folder.
# ClonalTree uses Python for compilation, so it needs the absolute path of the Python interpreter in the "python_path" parameter.
# ClonalTree returns two files in the "ClonalTree/Examples/output" folder.
# ClonalFamily_index.nk: the reconstructed BCR lineage tree in newick format.
# ClonalFamily_index.nk.csv: a table in csv format, containing the parent relationship and cost.
clonal.tree.generation(bcr_clusters = COVID_01_clusters, index = 200, raw_data = COVID_raw_data_list$COVID_01, python_path = "/Users/wangkaixuan/anaconda3/bin/python") # Replace with your path
# You can use the clonal.tree.plot() function to visualize the evolutionary tree in R.
# The consensus sequence of the cluster is used as the root node of the tree.
# Orange points represent nodes and blue points represent tips.
# The x-axis shows the absolute genetic distance.
clonal.tree.plot(nk_path = "ClonalTree/Examples/output/ClonalFamily_200.abRT.nk")
```
```{r, fig.height = 4, fig.width = 5}
## 1.4 Class switch recombination (CSR) network
# Visualization of isotype co-occurrence within clusters within a cluster
# Circle size represents the number of sequences carrying a given isotype.
# Lines connecting two circles indicate the enrichment level of observing switches in the two corresponding immunoglobulin subclasses.
# The enrichment level is the ratio of observed and expected switches if immunoglobulin isotypes are assumed to be independently distributed among cluster.
# Matrix of values of connected edges between clustered sequences in different isotypes is printed.
CSR.cluster.plot(bcr_clusters = COVID_01_clusters, index = 50)
```

```{r}
## 2. Single sample/group analysis
## 2.1 Classification of clustered and unclustered sequences
# Merge all the clustered sequences in each sample into "clustered_seqs".
# Merge all the unclustered sequences in each sample into "unclustered_seqs".
COVID_seqs_list <- Clustered.seqs(pro_data_list = COVID_pro_data_list, clusters_list = COVID_clusters_list)
HC_seqs_list <- Clustered.seqs(pro_data_list = HC_pro_data_list, clusters_list = HC_clusters_list)
```

```{r}
## 2.2 Summary of clusters from a sample
# Summarize the number of clusters, the average size of clusters and the proportion of clustered sequences.
COVID_clusters_summary <- Clusters.summary(pro_data_list = COVID_pro_data_list, clusters_list = COVID_clusters_list)
HC_clusters_summary <- Clusters.summary(pro_data_list = HC_pro_data_list, clusters_list = HC_clusters_list)
COVID_01_summary <- COVID_clusters_summary$COVID_01
print('Summary of COVID_01:')
print(COVID_01_summary)
HC_01_summary <- HC_clusters_summary$HC_01
print('Summary of HC_01:')
print(HC_01_summary)
```

```{r}
## 2.3 Visualization of clusters from a sample
# Point diagram showing clusters from a sample where a circle represents a cluster.
# The size and color of the circle represents the size of the cluster.
# The parameter "index" represents the index of the sample for visualization.
Clusters.visualization(pro_data_list = COVID_pro_data_list, clusters_list = COVID_clusters_list, index = 1)
Clusters.visualization(pro_data_list = HC_pro_data_list, clusters_list = HC_clusters_list, index = 1)
```

```{r}
## 2.4 V/J gene usage
## 2.4.1 Pie chart: V/J gene
# Pie chart showing the gene usage frequency of clustered sequences.
# The top ten most frequent genes are shown, and the rest are represented by "others".
# The parameter "colname" can be "v_call" for V gene or "j_call" for J gene.
# (1) single sample
COVID_01_clustered_seqs <- COVID_seqs_list$clustered_seqs$COVID_01
COVID_01_unclustered_seqs <- COVID_seqs_list$unclustered_seqs$COVID_01
HC_01_clustered_seqs <- HC_seqs_list$clustered_seqs$HC_01
pie.freq.plot(clustered_seqs = COVID_01_clustered_seqs, colname = "v_call")
pie.freq.plot(clustered_seqs = HC_01_clustered_seqs, colname = "v_call")
# (2) multiple samples in a group.
# All the clustered/unclustered sequences from multiple samples in a group can be merged.
COVID_all_clustered_seqs <- NULL
for(i in 1:length(COVID_seqs_list$clustered_seqs)){
  COVID_all_clustered_seqs <- rbind(COVID_all_clustered_seqs, COVID_seqs_list$clustered_seqs[[i]])
}
COVID_all_unclustered_seqs <- NULL
for(i in 1:length(COVID_seqs_list$unclustered_seqs)){
  COVID_all_unclustered_seqs <- rbind(COVID_all_unclustered_seqs, COVID_seqs_list$unclustered_seqs[[i]])
}
HC_all_clustered_seqs <- NULL
for(i in 1:length(HC_seqs_list$clustered_seqs)){
  HC_all_clustered_seqs <- rbind(HC_all_clustered_seqs, HC_seqs_list$clustered_seqs[[i]])
}
HC_all_unclustered_seqs <- NULL
for(i in 1:length(HC_seqs_list$unclustered_seqs)){
  HC_all_unclustered_seqs <- rbind(HC_all_unclustered_seqs, HC_seqs_list$unclustered_seqs[[i]])
}
pie.freq.plot(clustered_seqs = COVID_all_clustered_seqs, colname = "v_call")
pie.freq.plot(clustered_seqs = HC_all_clustered_seqs, colname = "v_call")
```
```{r, fig.height = 8, fig.width = 3.2}
## 2.4.2 Heatmap: V-J gene pair
# Heatmap showing the V-J gene pair frequency of clustered sequences
# (1) single sample.
vjpair.sample.plot(clustered_seqs = COVID_01_clustered_seqs)
# (2) multiple samples in a group.
vjpair.sample.plot(clustered_seqs = COVID_all_clustered_seqs)
vjpair.sample.plot(clustered_seqs = HC_all_clustered_seqs)
```

```{r, fig.height = 2, fig.width = 4}
## 2.5 Junction length distribution
# 2.5.1 Histogram and density plot showing the length distribution of junction amino acid of clustered sequences.
# (1) single sample
len.sample.plot(clustered_seqs = COVID_01_clustered_seqs)
# (2) multiple samples in a group
len.sample.plot(clustered_seqs = COVID_all_clustered_seqs)
```
```{r, fig.height = 2, fig.width = 3.5}
# 2.5.2 Density ridges showing the length distributions of junction amino acid of clustered sequences and unclustered sequences.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
# (1) single sample
len.clustered.plot(clustered_seqs = COVID_01_clustered_seqs,
                   unclustered_seqs = COVID_01_unclustered_seqs)
# (2) multiple samples in a group
len.clustered.plot(clustered_seqs = COVID_all_clustered_seqs,
                   unclustered_seqs = COVID_all_unclustered_seqs)
```

```{r, fig.height = 4, fig.width = 5}
## 2.6 Class switch recombination (CSR) network
# Visualization of isotype co-occurrence within clusters from a sample.
# Circle size represents the number of sequences carrying a given isotype.
# Lines connecting two circles indicate the enrichment level of observing switches in the two corresponding immunoglobulin subclasses.
# The enrichment level is the ratio of observed and expected switches if immunoglobulin isotypes are assumed to be independently distributed among cluster.
# Matrix of values of connected edges between clustered sequences in different isotypes is printed.
CSR.sample.plot(bcr_clusters = COVID_01_clusters)
CSR.sample.plot(bcr_clusters = HC_01_clusters)
```

```{r}
### 2.7 Neutralizing antibody (NAb) sequence query
## 2.7.1 Load the public antibody database to get the information of neutralizing antibody (NAb) sequences.
# Here is an example of the Coronavirus Antibody Database (CoV-AbDab).
CoV_AbDab <- read.csv("example/CoV-AbDab_130623.csv")
## 2.7.2 NAb query
# We define that if a sequence in a cluster shares the same V and J genes with any sequence in the public antibody library and their CDRH3 matches, then this sequence is known as a NAb sequence. 
# The parameter "method" represents the CDRH3 matching method.
# It can be "NA" for perfect match, "hamming" for hamming distance or "lv" for Levenshtein distance. Defaults to "NA".
# The parameter "species" can be "Mouse" or "Human". Defaults to "Human".
# The parameter "maxDist" represents the maximum distance allowed for matching when the argument "method" is "hamming" or "lv". Defaults to "NA".
# example to perfect match
human_perfect_match <- NAb.query(bcr_clusters = COVID_01_clusters, AbDab = CoV_AbDab, method = NA, maxDist = NA, species = "Human")
head(human_perfect_match)
# example with perfect match in "Mouse" species
mouse_perfect_match <- NAb.query(bcr_clusters = COVID_01_clusters, AbDab = CoV_AbDab, method = NA, maxDist = NA, species = "Mouse")
# example with fuzzy matching with "hamming" method and max distance of 1
human_hamming_1_match <- NAb.query(bcr_clusters = COVID_01_clusters, AbDab = CoV_AbDab, method = "hamming", maxDist = 1, species = "Human")
```

```{r, fig.height = 2.5, fig.width = 12}
## 3. Inter group analysis
## 3.1 V/J gene usage
## 3.1.1 Boxplot: V/J gene
# Boxplot showing the V/J gene usage of the clustered sequences between two groups.
# The parameter "colname" can be "v_call" for V gene or "j_call" for J gene.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
gene.fre.plot(group1_seqs_list = COVID_seqs_list,
              group1_all_clustered_seqs = COVID_all_clustered_seqs,
              group1_label = "COVID-19",
              group2_seqs_list = HC_seqs_list,
              group2_all_clustered_seqs = HC_all_clustered_seqs,
              group2_label = "HC",
              colname = "v_call")
```

```{r, fig.height = 2.3, fig.width = 3.4}
gene.fre.plot(group1_seqs_list = COVID_seqs_list,
              group1_all_clustered_seqs = COVID_all_clustered_seqs,
              group1_label = "COVID-19",
              group2_seqs_list = HC_seqs_list,
              group2_all_clustered_seqs = HC_all_clustered_seqs,
              group2_label = "HC",
              colname = "j_call")
```

```{r, fig.height = 3.7, fig.width = 16}
## 3.1.2 Heatmap: V-J gene pair
# Heatmap showing the fold change of V-J gene pair frequency of clustered sequences between two groups.
# Log fold change (log FC) is calculated as the log2 ratio of the average values between group1 and group2 samples.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
# FDR correction was performed with the Benjamini–Hochberg procedure.
# The V-J gene pair frequency of samples with a frequency less than the minimum value (1‰) is set to the minimum value.
vjpair.group.plot(group1_seqs_list = COVID_seqs_list,
                  group1_all_clustered_seqs = COVID_all_clustered_seqs,
                  group1_label = "COVID-19",
                  group2_seqs_list = HC_seqs_list,
                  group2_all_clustered_seqs = HC_all_clustered_seqs,
                  group2_label = "HC")
```

```{r, fig.height = 2, fig.width = 3.5}
## 3.2 Junction length
# Density ridges showing the length distribution of junction amino acid of clustered sequences between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
len.group.plot(group1_all_clustered_seqs = COVID_all_clustered_seqs, group1_label = "COVID-19",
               group2_all_clustered_seqs = HC_all_clustered_seqs, group2_label = "HC")
```

```{r, fig.height = 3, fig.width = 5}
## 3.3 Diversity analysis
## 3.3.1 Number and size of clusters
# Bubble plot showing the size and number of clusters between two groups.
clu.size.plot(clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
              clusters_list2 = HC_clusters_list, group2_label = "HC")
```

```{r, fig.height = 2, fig.width = 2.2}
## 3.3.2 Tcf20 score
# Tcf20 score represents the proportion of sequences attributed to the top 20 clonal families out of the total number of BCR sequences.
# Boxplot showing the Tcf20 scores between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
Tcf20.plot(clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
           clusters_list2 = HC_clusters_list, group2_label = "HC")
```
```{r}
## 3.4 Somatic hypermutation (SHM) analysis
# Calculate the average SHM ratio of all clusters in each sample.
# The calculation of SHM ratios may take a while.
SHM_df = SHM.calculation(clusters_list1 = COVID_clusters_list, raw_data_list1 = COVID_raw_data_list, group1_label = "COVID-19",
                         clusters_list2 = HC_clusters_list, raw_data_list2 = HC_raw_data_list, group2_label = "HC")
```


```{r, fig.height = 2, fig.width = 2.2}
# Boxplot showing the SHM ratios between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
SHM.plot(SHM_df = SHM_df)
```

```{r}
# Calculate the SHM ratios of clustered sequences in different isotypes in each sample.
# The calculation of SHM ratios may take a while.
SHM_iso_df = SHM.iso.calculation(clusters_list1 = COVID_clusters_list, raw_data_list1 = COVID_raw_data_list, group1_label = "COVID-19",
                                 clusters_list2 = HC_clusters_list, raw_data_list2 = HC_raw_data_list, group2_label = "HC")
```

```{r, fig.height = 2.5, fig.width = 5}
# Boxplot showing the SHM ratios of clustered sequences in different isotypes between two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
SHM.iso.plot(SHM_iso_df = SHM_iso_df)
```

```{r}
### 3.5 NAb ratio calculation
## 3.5.1 Load the public antibody database to get the information of neutralizing antibody (NAb) sequences.
# Here is an example of the Coronavirus Antibody Database (CoV-AbDab).
CoV_AbDab <- read.csv("example/CoV-AbDab_130623.csv")

## 3.5.2 NAb ratio calculation
# NAb ratio is established as an indicator of the proportional prevalence of neutralizing antibody sequences within expanded clonalfamilies in each sample.
# It is defined as the fraction of the number of NAb sequences within clonal families to the total number of NAb sequences present in each sample.
# NAb sequences are queried using NAb.query() function where the parameter "method" represents the CDRH3 matching method.
# It can be "NA" for perfect match, "hamming" for hamming distance or "lv" for Levenshtein distance. Defaults to "NA".
# The parameter "species" can be "Mouse" or "Human". Defaults to "Human".
# The parameter "maxDist" represents the maximum distance allowed for matching when the argument "method" is "hamming" or "lv". Defaults to "NA".
NAb_ratio_df <- NAb.ratio.calculation(pro_data_list1 = COVID_pro_data_list, clusters_list1 = COVID_clusters_list, group1_label = "COVID-19",
                                      pro_data_list2 = HC_pro_data_list, clusters_list2 = HC_clusters_list, group2_label = "HC",
                                      AbDab = CoV_AbDab, method = NA, maxDist = NA, species = "Human")

```

```{r, fig.height = 2, fig.width = 2.2}
# Boxplot showing the NAb ratios between the two groups.
# Statistical comparisons are carried out by the two-sided Wilcoxon rank-sum test.
NAb.ratio.plot(NAb_ratio_df)
```

